反向传播

Note

正向传播即计算输出
反向传播即使用链式法则从输出到输入计算梯度

正向传播

搭建神经网络就像是搭乐高积木:

jupyter

我们用中括号标识层,比如说在上图中, \([0]\) 标识输入层, \([1]\) 标识隐藏层, \([2]\) 标识输出层

\(\mathbf{a}^{[l]}\) 表示 \(l\) 层的输出, 并令 \(\mathbf{a}^{[0]} = x\)

\(\mathbf{z}^{[l]}\) 表示 \(l\) 层的仿射结果

\(g^{[l]}\) 表示 \(l\) 层的激活函数

正向传播即:

\[\mathbf{z}^{[l]} = \mathbf{W}^{[l]}\mathbf{a}^{[l-1]} + \mathbf{b}^{[l]}\]
\[\mathbf{a}^{[l]} = g^{[l]}(\mathbf{z}^{[l]})\]

其中 \(\mathbf{W}^{[l]} \in \mathbb{R}^{d[l] \times d[l-1]}\), \(\mathbf{b}^{[l]} \in \mathbb{R}^{d[l]}\).

预备知识

1.假设在正向传播中 \(\mathbf{x} \to \mathbf{y} \to L\), 其中 \(L \in \mathbb{R}\)是损失, \(\mathbf{x} \in \mathbb{R}^{n}\), \(\mathbf{y} \in \mathbb{R} ^{m}\)\(\mathbf{x}\) 更靠近输出层:

\[\begin{split} \frac{\partial L}{\partial \mathbf{y}} = \begin{bmatrix} \frac{\partial L}{\partial y_{1}} \\ ...\\ \frac{\partial L}{\partial y_{m}} \end{bmatrix} \quad , \quad \frac{\partial L}{\partial \mathbf{x}} = \begin{bmatrix} \frac{\partial L}{\partial x_{1}} \\ ...\\ \frac{\partial L}{\partial x_{n}} \end{bmatrix} \end{split}\]

使用全微分公式:

\[ \frac{\partial L}{\partial x_{k}} = \sum_{j=1}^{m}\frac{\partial L}{\partial y_{j}}\frac{\partial y_{j}}{\partial x_{k}} \]

从而我们可以计算 \(\frac{\partial L}{\partial \mathbf{x}}\)\(\frac{\partial L}{\partial \mathbf{y}}\) 的关系:

\[\begin{split} \frac{\partial L}{\partial \mathbf{x}} = \begin{bmatrix} \frac{\partial L}{\partial x_{1}} \\ ...\\ \frac{\partial L}{\partial x_{n}} \end{bmatrix} = \begin{bmatrix} \frac{\partial y_{1}}{\partial x_{1}} & ... & \frac{\partial y_{m}}{\partial x_{1}}\\ \vdots & \ddots & \vdots \\ \frac{\partial y_{1}}{\partial x_{n}}& .... & \frac{\partial y_{m}}{\partial x_{n}} \end{bmatrix} \begin{bmatrix} \frac{\partial L}{\partial y_{1}} \\ ...\\ \frac{\partial L}{\partial y_{m}} \end{bmatrix} = (\frac{\partial \mathbf{y}}{\partial \mathbf{x}})^{T}\frac{\partial L}{\partial \mathbf{y}} \end{split}\]

这里 \(\frac{\partial \mathbf{y}}{\partial \mathbf{x}}\) 是 jacobian 矩阵.

2.矩阵乘法的 jacobian 矩阵,这很容易验证:

\[\frac{\partial \mathbf{M}\mathbf{x}}{\partial \mathbf{x}}=\mathbf{M}\]

反向传播

recall 梯度下降公式:

\[\mathbf{W}^{[l]} = \mathbf{W}^{[l]} - \alpha\frac{\partial{L}}{\partial{\mathbf{W}^{[l]}}}\]
\[\mathbf{b}^{[l]} = \mathbf{b}^{[l]} - \alpha\frac{\partial{L}}{\partial{\mathbf{b}^{[l]}}}\]

我们需要计算 \(L\) 对各参数的梯度

分三步走:

1.计算输出层的梯度 \(\frac{\partial L}{\partial \mathbf{z}^{[N]}}\) :

\[ \frac{\partial L}{\partial \mathbf{z}^{[N]}} = (\frac{\partial \mathbf{a}^{[N]}}{\partial \mathbf{z}^{[N]}})^{T}\frac{\partial L}{\partial \mathbf{a}^{[L]}} \]

2.计算隐藏层的梯度 \(\frac{\partial L}{\partial \mathbf{z}^{[l]}}, l=N-1,...,1\):

\[\mathbf{z}^{[l + 1]} = \mathbf{W}^{[l + 1]}\mathbf{a}^{[l]} + \mathbf{b}^{[l + 1]}\]

通过前面的预备知识我们知道:

\[ \frac{\partial L}{\partial \mathbf{a}^{[l]}} = (\frac{\partial \mathbf{z}^{[l+1]}}{\partial \mathbf{a}^{[l]}})^{T}\frac{\partial L}{\partial \mathbf{z}^{[l+1]}} = (\mathbf{W}^{[l+1]})^{T}\frac{\partial L}{\partial \mathbf{z}^{[l+1]}} \]

注意到隐藏层的激活函数 \(g^{[l]}\) 不会相互依赖,因此:

\[\frac{\partial L}{\partial \mathbf{z}^{[l]}} = \frac{\partial L}{\partial \mathbf{a}^{[l]}} \odot {g^{[l]}}'(\mathbf{z}^{[l]})\]

结合起来:

\[\frac{\partial L}{\partial \mathbf{z}^{[l]}} = (\mathbf{W}^{[l+1]})^{T}\frac{\partial L}{\partial \mathbf{z}^{[l+1]}} \odot {g^{[l]}}'(\mathbf{z}^{[l]})\]

3.计算参数的梯度 \(\frac{\partial L}{\partial \mathbf{W}^{[l]}}\)\(\frac{\partial L}{\partial \mathbf{b}^{[l]}}\) for \(l=N,...,1\):

\[\mathbf{z}^{[l]} = \mathbf{W}^{[l]}\mathbf{a}^{[l - 1]} + \mathbf{b}^{[l]}\]

通过链式法则可以得到:

\[\frac{\partial L}{\partial \mathbf{W}^{[l]}} = \frac{\partial L}{\partial \mathbf{z}^{[l]}}(\mathbf{a}^{[l - 1]})^{T}\]
\[\frac{\partial L}{\partial \mathbf{b}^{[l]}}=\frac{\partial L}{\partial \mathbf{z}^{[l]}}\]